ACM Home Page
Please provide us with feedback. Feedback
A fast and robust method for web page template detection and removal
Full text PdfPdf (316 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 15th ACM international conference on Information and knowledge management table of contents
Arlington, Virginia, USA
SESSION: Detection and evidence table of contents
Pages: 258 - 267  
Year of Publication: 2006
ISBN:1-59593-433-2
Authors
Karane Vieira  Universidade Federal do Amazonas, Manaus, AM, Brazil
Altigran S. da Silva  Universidade Federal do Amazonas, Manaus, AM, Brazil
Nick Pinto  Universidade Federal do Amazonas, Manaus, AM, Brazil
Edleno S. de Moura  Universidade Federal do Amazonas, Manaus, AM, Brazil
João M. B. Cavalcanti  Universidade Federal do Amazonas, Manaus, AM, Brazil
Juliana Freire  University of Utah, Salt Lake City, UT
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 111,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183614.1183654
What is a DOI?

ABSTRACT

The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
 
5
 
6
7
8
9
10
 
11
12
 
13
A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on the Web and Databases, June 2002.
 
14
S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6:184--186, 1977.
15
16
17
 
18
G. Valiente. An efficient bottom-up distance between trees. In Proceedings of the International Symposium on String Processing and Information Retrieval, pages 212--219. IEEE Computer Science Press, 2001.
 
19
J. T. L. Wang and K. Zhang. Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recognition, 34:127--137, 2001.
 
20
21
 
22


Collaborative Colleagues:
Karane Vieira: colleagues
Altigran S. da Silva: colleagues
Nick Pinto: colleagues
Edleno S. de Moura: colleagues
João M. B. Cavalcanti: colleagues
Juliana Freire: colleagues