|
ABSTRACT
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
L. Arllota, V. Crescenzi, G. Mecca, and P. Merialdo. Automatic annotation of data extraction from large Web sites. In Proceedings of the International Workshop on the Web and Databases, pages 7--12, San Diego, USA, 2003.
|
| |
3
|
|
 |
4
|
Vijay Boyapati , Kristie Chevrier , Avi Finkel , Natalie Glance , Tom Pierce , Robert Stockton , Chip Whitmer, ChangeDetector: a site-level monitoring tool for the WWW, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511521]
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
Minos Garofalakis , Aristides Gionis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, XTRACT: a system for extracting document type descriptors from XML documents, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.165-176, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, USA, June 2002.
|
| |
17
|
S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6:184--186, Dec. 1977.
|
 |
18
|
|
| |
19
|
G. Valiente. An efficient bottom-up distance between trees. In Proceedings of the 8th International Symposium on String Processing and Information Retrieval, pages 212--219, Santiago, Chile, 2001. IEEE Computer Science Press.
|
| |
20
|
G. Valiente. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.
|
| |
21
|
|
| |
22
|
J. T. L. Wang and K. Zhang. Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recognition, 34:127--137, 2001.
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
CITED BY 37
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yunhua Hu , Guomao Xin , Ruihua Song , Guoping Hu , Shuming Shi , Yunbo Cao , Hang Li, Title extraction from bodies of HTML documents and its application to web page retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
Márcio L. A. Vidal , Altigran S. da Silva , Edleno S. de Moura , João M. B. Cavalcanti, Structure-driven crawler generation by example, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , Altigran S. da Silva , Nick Pinto , Edleno S. de Moura , João M. B. Cavalcanti , Juliana Freire, A fast and robust method for web page template detection and removal, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yewei Xue , Yunhua Hu , Guomao Xin , Ruihua Song , Shuming Shi , Yunbo Cao , Chin-Yew Lin , Hang Li, Web page title extraction and its application, Information Processing and Management: an International Journal, v.43 n.5, p.1332-1347, September, 2007
|
|
|
|
|
|
Eli Cortez , Altigran S. da Silva , Marcos André Gonçalves , Filipe Mesquita , Edleno S. de Moura, FLUX-CIM: flexible unsupervised extraction of citation metadata, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
Nilton Bila , Troy Ronda , Iqbal Mohomed , Khai N. Truong , Eyal de Lara, PageTailor: reusable end-user customization for the mobile web, Proceedings of the 5th international conference on Mobile systems, applications and services, June 11-13, 2007, San Juan, Puerto Rico
|
|
|
|
|
|
|
|
|
|
|
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
Junfeng Wang , Xiaofei He , Can Wang , Jian Pei , Jiajun Bu , Chun Chen , Ziyu Guan , Gang Lu, News article extraction with template-independent wrapper, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|
|
|
|
|
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen, Template-independent news extraction based on visual consistency, Proceedings of the 22nd national conference on Artificial intelligence, p.1507-1512, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Junfeng Wang , Chun Chen , Can Wang , Jian Pei , Jiajun Bu , Ziyu Guan , Wei Vivian Zhang, Can we learn a template-independent wrapper for news article extraction from a single training site?, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|