|
ABSTRACT
Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
BrightPlanet Corp. "The Deep Web: Surfacing hidden value." http://www.completeplanet.com/Tutorials/DeepWeb/
|
| |
4
|
|
 |
5
|
|
| |
6
|
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom. "The TRIMMIS project: integration of heterogeneous information sources," Proc. IPSJ Conference, 1994, 7--18.
|
| |
7
|
|
 |
8
|
D. W. Embley , Y. Jiang , Y.-K. Ng, Record-boundary discovery in Web documents, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.467-478, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
T. Kirk, A. Levy, Y. Sagiv and D. Srivastava. "The Information Manifold," Proc. the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, 1995, 85--91.
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
S. Raghavan and H. Garcia-Molina. "Integrating diverse information management systems: a brief survey," IEEE Data Engineering Bulletin 24(4), 2001, 44--52.
|
 |
17
|
Berthier Ribeiro-Neto , Alberto H. F. Laender , Altigran S. da Silva, Extracting semi-structured data through examples, Proceedings of the eighth international conference on Information and knowledge management, p.94-101, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.319962]
|
| |
18
|
A. Sahuguet and F. Azavant. "WysiWyg web wrapper factory (W4F)," Proc. 8th World Wide Web, 1999.
|
| |
19
|
|
| |
20
|
J. Wang and F. Lochovsky. "Wrapper Induction based on nested pattern discovery." , Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002 (submitted for publication). http://www.cs.ust.hk/~cswangjy/paper/tr-27-02.pdf
|
| |
21
|
World Wide Web Consortium. Document Object Model Level 3 Core Specification, 2001.
|
| |
22
|
World Wide Web Consortium. HTML 4.01 Specification, 1999.
|
CITED BY 31
|
|
|
|
|
|
|
|
|
|
|
Hanny Yulius Limanto , Nguyen Ngoc Giang , Vo Tan Trung , Jun Zhang , Qi He , Nguyen Quang Huy, An information extraction engine for web discussion forums, Special interest tracks and posters of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sidi Mohamed Benslimane , Djamal Benslimane , Mimoun Malki , Youssef Amghar , Hamadou Saliah-Hassane, Acquiring owl ontologies from data-intensive web sites, Proceedings of the 6th international conference on Web engineering, July 11-14, 2006, Palo Alto, California, USA
|
|
|
|
|
|
Zaiqing Nie , Yunxiao Ma , Shuming Shi , Ji-Rong Wen , Wei-Ying Ma, Web object retrieval, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Jiying Wang , Ji-Rong Wen , Fred Lochovsky , Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing, Proceedings of the Thirtieth international conference on Very large data bases, p.408-419, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Manuel Álvarez , Alberto Pan , Juan Raposo , Fernando Bellas , Fidel Cacheda, Extracting lists of data records from semi-structured web pages, Data & Knowledge Engineering, v.64 n.2, p.491-509, February, 2008
|
|
|
|
|
|
Shuyi Zheng , Matthew R. Scott , Ruihua Song , Ji-Rong Wen, Pictor: an interactive system for importing data from a website, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
Gengxin Miao , Junichi Tatemura , Wang-Pin Hsiung , Arsany Sawires , Louise E. Moser, Extracting data records from the web using tag path clustering, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|