ACM Home Page
Please provide us with feedback. Feedback
Data extraction and label assignment for web databases
Full text PdfPdf (652 KB)
Source International World Wide Web Conference archive
Proceedings of the 12th international conference on World Wide Web table of contents
Budapest, Hungary
SESSION: Establishing the semantic web 1 table of contents
Pages: 187 - 196  
Year of Publication: 2003
ISBN:1-58113-680-3
Authors
Jiying Wang  University of Science and Technology Clear Water Bay, Kowloon, Hong Kong
Fred H. Lochovsky  University of Science and Technology Clear Water Bay, Kowloon, Hong Kong
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 222,   Citation Count: 30
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775152.775179
What is a DOI?

ABSTRACT

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
BrightPlanet Corp. "The Deep Web: Surfacing hidden value." http://www.completeplanet.com/Tutorials/DeepWeb/
 
4
5
 
6
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom. "The TRIMMIS project: integration of heterogeneous information sources," Proc. IPSJ Conference, 1994, 7--18.
 
7
8
9
 
10
 
11
 
12
T. Kirk, A. Levy, Y. Sagiv and D. Srivastava. "The Information Manifold," Proc. the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, 1995, 85--91.
 
13
14
 
15
 
16
S. Raghavan and H. Garcia-Molina. "Integrating diverse information management systems: a brief survey," IEEE Data Engineering Bulletin 24(4), 2001, 44--52.
17
 
18
A. Sahuguet and F. Azavant. "WysiWyg web wrapper factory (W4F)," Proc. 8th World Wide Web, 1999.
 
19
 
20
J. Wang and F. Lochovsky. "Wrapper Induction based on nested pattern discovery." , Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002 (submitted for publication). http://www.cs.ust.hk/~cswangjy/paper/tr-27-02.pdf
 
21
World Wide Web Consortium. Document Object Model Level 3 Core Specification, 2001.
 
22
World Wide Web Consortium. HTML 4.01 Specification, 1999.

CITED BY  31

Collaborative Colleagues:
Jiying Wang: colleagues
Fred H. Lochovsky: colleagues