ACM Home Page
Please provide us with feedback. Feedback
Template extraction from candidate template set generation: a structure and content approach
Full text PdfPdf (467 KB)
Source ACM Southeast Regional Conference archive
Proceedings of the 43rd annual Southeast regional conference - Volume 2 table of contents
Kennesaw, Georgia
SESSION: Software design, languages and systems table of contents
Pages: 211 - 216  
Year of Publication: 2005
ISBN:1-59593-059-0
Authors
Hang Su  Vanderbilt University, Nashville, TN
Qiaozhu Mei  University of Illinois at Urbana-Champaign, Urbana, IL
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 23,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1167253.1167303
What is a DOI?

ABSTRACT

This paper introduces a new approach of webpage template extraction. Unlike traditional methods which concern only content information, this paper considers both structure and content similarity. It uses natural table structure as content units instead of text blocks or pagelets. This paper novelly and formally defines the templates and other concepts. It introduces a new concept, candidate template, which is an intermediate level of abstract table structure. A candidate template only covers the most informative tables, and abstracts a large page set with similar structures. This paper proposes a novel approach of template extraction by solving three sub problems surrounding candidate template set. The involving of candidate template set solves the accuracy and efficiency problems of traditional approaches. This paper also introduces a new model for structural similarity, and for table informativeness based on six heuristics.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. Lawrence and C. L. Giles. Accessibility of information on the Web. Nature, 400(6740):107--109, 1999.
2
 
3
Wang Zhulong, Yu Hao, and Nishino Fumihito. 2004. Automatic Special type Website Detection Based on Webpage Type Classification. International Workshop on Web Engineering in conjunction with ACM Hypertext 2004, Aug. 2004
4
 
5
J. Tiedemann. Automatic Construction of Weighted String Similarity Measures. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
6
 
7
 
8
 
9
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley & Sons, New York, 1990.
10
 
11
Roni Rosenfeld, Two decades of Statistical Language Modeling: Where Do We Go From Here?, Proceedings of the IEEE 88(8), August 2000
12
13