ACM Home Page
Please provide us with feedback. Feedback
Automatic extraction of informative blocks from webpages
Full text PdfPdf (201 KB)
Source Symposium on Applied Computing archive
Proceedings of the 2005 ACM symposium on Applied computing table of contents
Santa Fe, New Mexico
SESSION: Web technologies and applications (WTA) table of contents
Pages: 1722 - 1726  
Year of Publication: 2005
ISBN:1-58113-964-0
Authors
Sandip Debnath  The Pennsylvania State University, PA
Prasenjit Mitra  The Pennsylvania State University, PA
C. Lee Giles  The Pennsylvania State University, PA
Sponsor
SIGAPP: ACM Special Interest Group on Applied Computing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): n/a,   Downloads (12 Months): n/a,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1066677.1067065
What is a DOI?

ABSTRACT

Search engines crawl and index webpages depending upon their informative content. However, webpages --- especially dynamically generated ones --- contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from webpages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
C. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In AAAI-98 Workshop on AI and Information Integration, pages 66--73. AAAI Press, 1998.
 
5
 
6
Nickolas Kushmerick. Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In International Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.
7
8
 
9
10


Collaborative Colleagues:
Sandip Debnath: colleagues
Prasenjit Mitra: colleagues
C. Lee Giles: colleagues