ACM Home Page
Please provide us with feedback. Feedback
Template detection for large scale search engines
Full text PdfPdf (417 KB)
Source Symposium on Applied Computing archive
Proceedings of the 2006 ACM symposium on Applied computing table of contents
Dijon, France
SESSION: Information access and retrieval (IAR) table of contents
Pages: 1094 - 1098  
Year of Publication: 2006
ISBN:1-59593-108-2
Authors
Liang Chen  Tsinghua University, Beijing, P.R.China
Shaozhi Ye  University of California, Davis, CA
Xing Li  Tsinghua University, Beijing, P.R.China
Sponsor
SIGAPP: ACM Special Interest Group on Applied Computing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 54,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1141277.1141534
What is a DOI?

ABSTRACT

Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we propose a novel two-stage template detection method, which combines template detection and removal with the index building process of a search engine. First, web pages are segmented into blocks and blocks are clustered according to their style features. Second, similar contents sharing the common layout style are detected during the index building process. The blocks with similar layout style and content are identified as templates and deleted. Our experiment on eight popular web sites shows that our method achieves 20-40% faster than shingle and SST methods with close accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
4
 
5
Deng Cai, Shipeng Yu, Ji-rong Wen and Wei-Ying Ma. Extracting Content Structure for Web Pages based on Visual Representation. In Proc. of the APWeb'03 Conf., number 2642 in LNCS, pages 406--417, 2003
6
7
8
9
10
11
 
12
Davision, B. D. Recognizing Nepotistic links on the Web. In Proc. of the AAAI Conf., pages 23--28, 2000
13
14
 
15
Navendu jain, Mike Dahlin, Renu Tewari. Using Bloom Filters to Refine Web Search Results. In the Eighth Workshop on Web and Database (WebDB'05), 2005


Collaborative Colleagues:
Liang Chen: colleagues
Shaozhi Ye: colleagues
Xing Li: colleagues