|
ABSTRACT
In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
BRAY, T., PAOLI, J., AND SPERBERG-MCQUEEN, M. Extensible markup language (XML) 1.0. http://www.w3.org/TR/REC-xml.
|
| |
7
|
BRIN, S., MOTWANI, R., PAGE, L., AND WINOGRAD, T. What can you do with a Web in your pocket? Data Engineering Bulletin 21, 2(1998), 37-47.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
D. W. Embley , D. M. Campbell , Y. S. Jiang , S. W. Liddle , D. W. Lonsdale , Y.---K. Ng , R. D. Smith, Conceptual-model-based data extraction from multiple-record Web pages, Data & Knowledge Engineering, v.31 n.3, p.227-251, Nov. 1999
[doi> 10.1016/S0169-023X(99)00027-0]
|
 |
12
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
13
|
|
| |
14
|
|
 |
15
|
Paulo B. Golgher , Altigran S. da Silva , Alberto H. F. Laender , Berthier Ribeiro-Neto, Bootstrapping for example-based data extraction, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502648]
|
 |
16
|
Joachim Hammer , Héctor García-Molina , Svetlozar Nestorov , Ramana Yerneni , Marcus Breunig , Vasilis Vassalos, Template-based wrappers in the TSIMMIS system, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.532-535, May 11-15, 1997, Tucson, Arizona, United States
|
| |
17
|
HAMMER, J., MCHUGH, J., AND GARCIA-MOLINA, H. Semistructured data: The TSIMMIS experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems (St. Petersburg, Russia, 1997), pp. 1-8.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
 |
25
|
G. Mecca , P. Atzeni , A. Masci , G. Sindoni , P. Merialdo, The Araneus Web-based management system, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.544-546, June 01-04, 1998, Seattle, Washington, United States
|
| |
26
|
MUSLEA, I. RISE: Repository of online information sources used in information extraction tasks. http://www.isi.edu/muslea/RISE/.
|
| |
27
|
MUSLEA, I. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, FL, 1999), pp. 1-6.
|
| |
28
|
|
| |
29
|
|
 |
30
|
Berthier Ribeiro-Neto , Alberto H. F. Laender , Altigran S. da Silva, Extracting semi-structured data through examples, Proceedings of the eighth international conference on Information and knowledge management, p.94-101, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.319962]
|
| |
31
|
|
| |
32
|
|
| |
33
|
TEIXEIRA, J. S. A Comparative Study of Approaches for Semistructured Data Extraction. Master's thesis, Department of Computer Science, Federal University of Minas Gerais, Brazil, 2001. In Portuguese.
|
| |
34
|
WORLD WIDE WEB CONSORTIUM. W3C. The Document Object Model. http://www.w3.org/DOM.
|
CITED BY 66
|
|
|
|
|
|
|
|
Pável P. Calado , Marcos A. Gonçalves , Edward A. Fox , Berthier Ribeiro-Neto , Alberto H. F. Laender , Altigran S. da Silva , Davi C. Reis , Pablo A. Roberto , Monique V. Vieira , Juliano P. Lage, The Web-DL environment for building digital libraries from the Web, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
|
|
|
|
|
|
|
|
|
|
|
Alberto H. F. Laender , Altigran S. da Silva , Paolo B. Golgher , Berthier Ribeiro-Neto , Irna M. R. Evangelista-Filha , Karine V. Magalhães, The Debye Environment for Web Data Management, IEEE Internet Computing, v.6 n.4, p.60-69, July 2002
|
|
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yasuhiro Yamada , Nick Craswell , Tetsuya Nakatoh , Sachio Hirokawa, Testbed for information extraction from deep web, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
|
|
|
Jun Fujima , Aran Lunzer , Kasper Hornbæk , Yuzuru Tanaka, Clip, connect, clone: combining application elements to build custom interfaces for information access, Proceedings of the 17th annual ACM symposium on User interface software and technology, October 24-27, 2004, Santa Fe, NM, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
Boanerges Aleman-Meza , Meenakshi Nagarajan , Cartic Ramakrishnan , Li Ding , Pranam Kolari , Amit P. Sheth , I. Budak Arpinar , Anupam Joshi , Tim Finin, Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Eli Cortez , Altigran S. da Silva , Marcos André Gonçalves , Filipe Mesquita , Edleno S. de Moura, FLUX-CIM: flexible unsupervised extraction of citation metadata, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
Kai Simon , Georg Lausen , Harold Boley, From HTML documents to web tables and rules, Proceedings of the 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet, August 13-16, 2006, Fredericton, New Brunswick, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Boanerges Aleman-Meza , Meenakshi Nagarajan , Li Ding , Amit Sheth , I. Budak Arpinar , Anupam Joshi , Tim Finin, Scalable semantic analytics on social networks for addressing the problem of conflict of interest detection, ACM Transactions on the Web (TWEB), v.2 n.1, p.1-29, February 2008
|
|
|
|
|
|
Karla A. V. Borges , Alberto H. F. Laender , Claudia B. Medeiros , Clodoveu A. Davis, Jr., Discovering geographic locations in web pages using urban addresses, Proceedings of the 4th ACM workshop on Geographical information retrieval, November 09-09, 2007, Lisbon, Portugal
|
|
|
|
|
|
Alberto Pan , Juan Raposo , Manuel Álvarez , Víctor Carneiro , Fernando Bellas, Automatically maintaining navigation sequences for querying semi-structured web sources, Data & Knowledge Engineering, v.63 n.3, p.795-810, December, 2007
|
|
|
Bettina Fazzinga , Sergio Flesca , Andrea Tagarelli , Salvatore Garruzzo , Elio Masciari, A wrapper generation system for PDF documents, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
Manuel Álvarez , Alberto Pan , Juan Raposo , Fernando Bellas , Fidel Cacheda, Extracting lists of data records from semi-structured web pages, Data & Knowledge Engineering, v.64 n.2, p.491-509, February, 2008
|
|
|
|
|
|
Shuyi Zheng , Matthew R. Scott , Ruihua Song , Ji-Rong Wen, Pictor: an interactive system for importing data from a website, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gengxin Miao , Junichi Tatemura , Wang-Pin Hsiung , Arsany Sawires , Louise E. Moser, Extracting data records from the web using tag path clustering, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen, Template-independent news extraction based on visual consistency, Proceedings of the 22nd national conference on Artificial intelligence, p.1507-1512, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Michael Toomim , Steven M. Drucker , Mira Dontcheva , Ali Rahimi , Blake Thomson , James A. Landay, Attaching UI enhancements to websites with end users, Proceedings of the 27th international conference on Human factors in computing systems, April 04-09, 2009, Boston, MA, USA
|
|
|
Junfeng Wang , Chun Chen , Can Wang , Jian Pei , Jiajun Bu , Ziyu Guan , Wei Vivian Zhang, Can we learn a template-independent wrapper for news article extraction from a single training site?, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
Qi Zhang , Yang Shi , Xuanjing Huang , Lide Wu, Template-independent wrapper for web forums, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|