| Template detection for large scale search engines |
| Full text |
Pdf
(417 KB)
|
| Source
|
Symposium on Applied Computing
archive
Proceedings of the 2006 ACM symposium on Applied computing
table of contents
Dijon, France
SESSION: Information access and retrieval (IAR)
table of contents
Pages: 1094 - 1098
Year of Publication: 2006
ISBN:1-59593-108-2
|
|
Authors
|
|
Liang Chen
|
Tsinghua University, Beijing, P.R.China
|
|
Shaozhi Ye
|
University of California, Davis, CA
|
|
Xing Li
|
Tsinghua University, Beijing, P.R.China
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 75, Citation Count: 6
|
|
|
ABSTRACT
Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we propose a novel two-stage template detection method, which combines template detection and removal with the index building process of a search engine. First, web pages are segmented into blocks and blocks are clustered according to their style features. Second, similar contents sharing the common layout style are detected during the index building process. The blocks with similar layout style and content are identified as templates and deleted. Our experiment on eight popular web sites shows that our method achieves 20-40% faster than shingle and SST methods with close accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Ling Ma , Nazli Goharian , Abdur Chowdhury , Misun Chung, Extracting unstructured data from template generated web documents, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956961]
|
 |
3
|
|
 |
4
|
|
| |
5
|
Deng Cai, Shipeng Yu, Ji-rong Wen and Wei-Ying Ma. Extracting Content Structure for Web Pages based on Visual Representation. In Proc. of the APWeb'03 Conf., number 2642 in LNCS, pages 406--417, 2003
|
 |
6
|
Lakshmish Ramaswamy , Arun Iyengar , Ling Liu , Fred Douglis, Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988732]
|
 |
7
|
Xiaoli Li , Tong-Heng Phang , Minqing Hu , Bing Liu, Using micro information units for internet search, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584885]
|
 |
8
|
|
 |
9
|
|
 |
10
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988740]
|
 |
11
|
|
| |
12
|
Davision, B. D. Recognizing Nepotistic links on the Web. In Proc. of the AAAI Conf., pages 23--28, 2000
|
 |
13
|
|
 |
14
|
|
| |
15
|
Navendu jain, Mike Dahlin, Renu Tewari. Using Bloom Filters to Refine Web Search Results. In the Eighth Workshop on Web and Database (WebDB'05), 2005
|
CITED BY 6
|
|
|
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|
|
|
|
|
|
|