|
ABSTRACT
Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Amazon.com. http://www.amazon.com.
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
Experimental results. http://www-db.stanford.edu/~arvind/extract/.
|
| |
7
|
Hector Garcia-Molina , Yannis Papakonstantinou , Dallan Quass , Anand Rajaraman , Yehoshua Sagiv , Jeffrey Ullman , Vasilis Vassalos , Jennifer Widom, The TSIMMIS Approach to Mediation: Data Models and Languages, Journal of Intelligent Information Systems, v.8 n.2, p.117-132, March/April 1997
[doi> 10.1023/A:1008683107812]
|
 |
8
|
Minos Garofalakis , Aristides Gionis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, XTRACT: a system for extracting document type descriptors from XML documents, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.165-176, May 15-18, 2000, Dallas, Texas, United States
|
| |
9
|
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.
|
| |
10
|
|
| |
11
|
|
| |
12
|
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.
|
| |
13
|
|
| |
14
|
IEPAD:. http://www.csie/ncu.edu.tw/~chia.
|
| |
15
|
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.
|
 |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
RISE:. http://www.isi.edu/~muslea/RISE/.
|
| |
22
|
J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.
|
| |
23
|
ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.
|
| |
24
|
S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.
|
| |
25
|
|
CITED BY 83
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
Ling Ma , Nazli Goharian , Abdur Chowdhury , Misun Chung, Extracting unstructured data from template generated web documents, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
Hanny Yulius Limanto , Nguyen Ngoc Giang , Vo Tan Trung , Jun Zhang , Qi He , Nguyen Quang Huy, An information extraction engine for web discussion forums, Special interest tracks and posters of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
Márcio L. A. Vidal , Altigran S. da Silva , Edleno S. de Moura , João M. B. Cavalcanti, Structure-driven crawler generation by example, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Eli Cortez , Altigran S. da Silva , Marcos André Gonçalves , Filipe Mesquita , Edleno S. de Moura, FLUX-CIM: flexible unsupervised extraction of citation metadata, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
R. Alonso-Calvo , V. Maojo , H. Billhardt , F. Martin-Sanchez , M. García-Remesal , D. Pérez-Rey, An agent- and ontology-based system for integrating public gene, protein, and disease databases, Journal of Biomedical Informatics, v.40 n.1, p.17-29, February, 2007
|
|
|
|
|
|
|
|
|
Jiying Wang , Ji-Rong Wen , Fred Lochovsky , Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing, Proceedings of the Thirtieth international conference on Very large data bases, p.408-419, August 31-September 03, 2004, Toronto, Canada
|
|
|
Valter Crescenzi , Giansalvatore Mecca , Paolo Merialdo , Paolo Missier, An automatic data grabber for large web sites, Proceedings of the Thirtieth international conference on Very large data bases, p.1321-1324, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Zhu , Bo Zhang , Zaiqing Nie , Ji-Rong Wen , Hsiao-Wuen Hon, Webpage understanding: an integrated approach, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Alberto Pan , Juan Raposo , Manuel Álvarez , Víctor Carneiro , Fernando Bellas, Automatically maintaining navigation sequences for querying semi-structured web sources, Data & Knowledge Engineering, v.63 n.3, p.795-810, December, 2007
|
|
|
|
|
|
Eunyee Koh , Daniel Caruso , Andruid Kerne , Ricardo Gutierrez-Osuna, Elimination of junk document surrogate candidates through pattern recognition, Proceedings of the 2007 ACM symposium on Document engineering, August 28-31, 2007, Winnipeg, Manitoba, Canada
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Jose Iria , Victoria Uren , Alberto Lavelli , Sebastian Blohm , Aba-sah Dadzie , Thomas Franz , Ioannis Kompatsiaris , Joao Magalhaes , Spiros Nikolopoulos , Christine Preisach , Piercarlo Slavazza, Enhancing enterprise knowledge processes via cross-media extraction, Proceedings of the 4th international conference on Knowledge capture, October 28-31, 2007, Whistler, BC, Canada
|
|
|
Manuel Álvarez , Alberto Pan , Juan Raposo , Fernando Bellas , Fidel Cacheda, Extracting lists of data records from semi-structured web pages, Data & Knowledge Engineering, v.64 n.2, p.491-509, February, 2008
|
|
|
|
|
|
|
|
|
|
|
|
Shuyi Zheng , Matthew R. Scott , Ruihua Song , Ji-Rong Wen, Pictor: an interactive system for importing data from a website, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
Lorenzo Blanco , Valter Crescenzi , Paolo Merialdo , Paolo Papotti, Supporting the automatic construction of entity aware search engines, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
Gengxin Miao , Junichi Tatemura , Wang-Pin Hsiung , Arsany Sawires , Louise E. Moser, Extracting data records from the web using tag path clustering, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen, Template-independent news extraction based on visual consistency, Proceedings of the 22nd national conference on Artificial intelligence, p.1507-1512, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Junfeng Wang , Chun Chen , Can Wang , Jian Pei , Jiajun Bu , Ziyu Guan , Wei Vivian Zhang, Can we learn a template-independent wrapper for news article extraction from a single training site?, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|