ACM Home Page
Please provide us with feedback. Feedback
A brief survey of web data extraction tools
Full text PdfPdf (1.37 MB)
Source ACM SIGMOD Record archive
Volume 31 ,  Issue 2  (June 2002) table of contents
COLUMN: Surveys table of contents
Pages: 84 - 93  
Year of Publication: 2002
ISSN:0163-5808
Authors
Alberto H. F. Laender  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Berthier A. Ribeiro-Neto  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Altigran S. da Silva  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Juliana S. Teixeira  Federal University of Minas Gerais, Belo Horizonte MG Brazil
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 45,   Downloads (12 Months): 394,   Citation Count: 66
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/565117.565137
What is a DOI?

ABSTRACT

In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
 
5
 
6
BRAY, T., PAOLI, J., AND SPERBERG-MCQUEEN, M. Extensible markup language (XML) 1.0. http://www.w3.org/TR/REC-xml.
 
7
BRIN, S., MOTWANI, R., PAGE, L., AND WINOGRAD, T. What can you do with a Web in your pocket? Data Engineering Bulletin 21, 2(1998), 37-47.
 
8
 
9
 
10
 
11
12
13
 
14
15
16
 
17
HAMMER, J., MCHUGH, J., AND GARCIA-MOLINA, H. Semistructured data: The TSIMMIS experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (St. Petersburg, Russia, 1997), pp. 1-8.
 
18
 
19
 
20
 
21
 
22
 
23
 
24
25
 
26
MUSLEA, I. RISE: Repository of online information sources used in information extraction tasks. http://www.isi.edu/muslea/RISE/.
 
27
MUSLEA, I. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, FL, 1999), pp. 1-6.
 
28
 
29
30
 
31
 
32
 
33
TEIXEIRA, J. S. A Comparative Study of Approaches for Semistructured Data Extraction. Master's thesis, Department of Computer Science, Federal University of Minas Gerais, Brazil, 2001. In Portuguese.
 
34
WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. http://www.w3.org/DOM.

CITED BY  66
Collaborative Colleagues:
Alberto H. F. Laender: colleagues
Berthier A. Ribeiro-Neto: colleagues
Altigran S. da Silva: colleagues
Juliana S. Teixeira: colleagues