ACM Home Page
Please provide us with feedback. Feedback
Automatic web news extraction using tree edit distance
Full text PdfPdf (1.97 MB)
Source International World Wide Web Conference archive
Proceedings of the 13th international conference on World Wide Web table of contents
New York, NY, USA
SESSION: Mining new media table of contents
Pages: 502 - 511  
Year of Publication: 2004
ISBN:1-58113-844-X
Authors
D. C. Reis  Federal University of Minas Gerais, Belo Horizonte, Brazil
P. B. Golgher  Akwan Information Technologies, Belo Horizonte, Brazil
A. S. Silva  Federal University of Amazonas, Manaus, Brazil
A. F. Laender  Federal University of Minas Gerais, Belo Horizonte, Brazil
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 37,   Downloads (12 Months): 225,   Citation Count: 37
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/988672.988740
What is a DOI?

ABSTRACT

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
L. Arllota, V. Crescenzi, G. Mecca, and P. Merialdo. Automatic annotation of data extraction from large Web sites. In Proceedings of the International Workshop on the Web and Databases, pages 7--12, San Diego, USA, 2003.
 
3
4
 
5
 
6
 
7
8
9
10
 
11
 
12
13
14
 
15
 
16
A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA, June 2002.
 
17
S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6:184--186, Dec. 1977.
18
 
19
G. Valiente. An efficient bottom-up distance between trees. In Proceedings of the 8th International Symposium on String Processing and Information Retrieval, pages 212--219, Santiago, Chile, 2001. IEEE Computer Science Press.
 
20
G. Valiente. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.
 
21
 
22
J. T. L. Wang and K. Zhang. Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recognition, 34:127--137, 2001.
 
23
24
 
25
 
26
 
27

CITED BY  38

Collaborative Colleagues:
D. C. Reis: colleagues
P. B. Golgher: colleagues
A. S. Silva: colleagues
A. F. Laender: colleagues