| Page-level template detection via isotonic smoothing |
| Full text |
Pdf
(288 KB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 16th international conference on World Wide Web
table of contents
Banff, Alberta, Canada
SESSION: Data mining
table of contents
Pages: 61 - 70
Year of Publication: 2007
ISBN:978-1-59593-654-7
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 11, Downloads (12 Months): 98, Citation Count: 6
|
|
|
ABSTRACT
We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Stanislav Angelov , Boulos Harb , Sampath Kannan , Li-San Wang, Weighted isotonic regression under the L1 norm, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.783-791, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109643]
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
7
|
|
 |
8
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
9
|
B. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.
|
| |
10
|
|
 |
11
|
|
| |
12
|
L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.
|
 |
13
|
Hung-Yu Kao , Ming-Syan Chen , Shian-Hua Lin , Jan-Ming Ho, Entropy-based link analysis for mining web informative structures, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584886]
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.
|
| |
18
|
|
| |
19
|
T. Morton-Jones, P. Diggle, L. Parker, H.O. Dickinson, and K. Blinks. Additive isotonic regression models in epidemiology. Statistics in Medicine, 19(6):849--859, 2000.
|
| |
20
|
P.M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23(3):211--222, 1999.
|
| |
21
|
T. Robertson, F.T. Wright, and R.L. Dykstra. Order-Restrictied Statistical Inference. Wiley, 1988.
|
 |
22
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988700]
|
| |
23
|
Q. Stout. Optimal algorithms for unimodal regression. Computing Science and Statistics, 32:348--355, 2000.
|
 |
24
|
Karane Vieira , Altigran S. da Silva , Nick Pinto , Edleno S. de Moura , João M. B. Cavalcanti , Juliana Freire, A fast and robust method for web page template detection and removal, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183654]
|
| |
25
|
L. Yi and B. Liu. Web page cleaning for web mining through feature weighting. In Proc. 18th IJCAI, pages 43--50, 2003.
|
 |
26
|
|
 |
27
|
|
CITED BY 6
|
|
|
|
|
Yu Wang , Bingxing Fang , Xueqi Cheng , Li Guo , Hongbo Xu, Incremental web page template detection, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|