| A fast and robust method for web page template detection and removal |
| Full text |
Pdf
(316 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 15th ACM international conference on Information and knowledge management
table of contents
Arlington, Virginia, USA
SESSION: Detection and evidence
table of contents
Pages: 258 - 267
Year of Publication: 2006
ISBN:1-59593-433-2
|
|
Authors
|
|
Karane Vieira
|
Universidade Federal do Amazonas, Manaus, AM, Brazil
|
|
Altigran S. da Silva
|
Universidade Federal do Amazonas, Manaus, AM, Brazil
|
|
Nick Pinto
|
Universidade Federal do Amazonas, Manaus, AM, Brazil
|
|
Edleno S. de Moura
|
Universidade Federal do Amazonas, Manaus, AM, Brazil
|
|
João M. B. Cavalcanti
|
Universidade Federal do Amazonas, Manaus, AM, Brazil
|
|
Juliana Freire
|
University of Utah, Salt Lake City, UT
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 103, Citation Count: 5
|
|
|
ABSTRACT
The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Computer Networks and ISDN Systems, v.30 n.1-7, p.65-74, April 1, 1998
|
 |
4
|
Soumen Chakrabarti , Mukul Joshi , Vivek Tawde, Enhanced topic distillation using text, markup tags, and hyperlinks, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.208-216, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383990]
|
| |
5
|
|
| |
6
|
|
 |
7
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988740]
|
 |
8
|
Edleno S. de Moura , Célia F. dos Santos , Daniel R. Fernandes , Altigran S. Silva , Pavel Calado , Mario A. Nascimento, Improving Web search efficiency via a locality based static pruning method, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060783]
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on the Web and Databases, June 2002.
|
| |
14
|
S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6:184--186, 1977.
|
 |
15
|
David Carmel , Doron Cohen , Ronald Fagin , Eitan Farchi , Michael Herscovici , Yoelle S. Maarek , Aya Soffer, Static index pruning for information retrieval systems, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.43-50, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383958]
|
 |
16
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988700]
|
 |
17
|
|
| |
18
|
G. Valiente. An efficient bottom-up distance between trees. In Proceedings of the International Symposium on String Processing and Information Retrieval, pages 212--219. IEEE Computer Science Press, 2001.
|
| |
19
|
J. T. L. Wang and K. Zhang. Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recognition, 34:127--137, 2001.
|
| |
20
|
|
 |
21
|
|
| |
22
|
|
CITED BY 5
|
|
|
|
|
|
|
|
David Fernandes , Edleno S. de Moura , Berthier Ribeiro-Neto , Altigran S. da Silva , Marcos André Gonçalves, Computing block importance for searching on web sites, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|