|
ABSTRACT
Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different Web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
Bergman, M. K. 2001. The deep Web: Surfacing hidden value. White paper, BrightPlanet Corporation. http://www.brightplanet.com/resources/details/deepweb.html.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
D. W. Embley , D. M. Campbell , Y. S. Jiang , S. W. Liddle , D. W. Lonsdale , Y.—K. Ng , R. D. Smith, Conceptual-model-based data extraction from multiple-record Web pages, Data & Knowledge Engineering, v.31 n.3, p.227-251, Nov. 1999
[doi> 10.1016/S0169-023X(99)00027-0]
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
 |
21
|
|
| |
22
|
Lu, Y., He, H., Zhao, H., Meng, W., and Yu, C. 2007. Annotating structured data of the deep Web. In Proceedings of the 23rd IEEE International Conference on Data Engineering. 376--385.
|
| |
23
|
Minka, T. 2003. A comparison of numerical optimizers for logistic regression. Tech. rep., Department of Statistics, Carnegie Mellon University.
|
 |
24
|
|
| |
25
|
Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--141.
|
| |
26
|
Roitman, H. and Gal, A. 2006. Ontobuilder: Fully automatic extraction and consolidation of ontologies from Web sources using sequence semantics. In Proceedings of the EDBT Workshops. 573--576.
|
 |
27
|
|
| |
28
|
Snoussi, H., Magnin, L., and Nie, J.-Y. 2001. Heterogeneous Web data extraction using ontologies. In Proceedings of the Conference on Agent-Oriented Information Systems. 99--110.
|
| |
29
|
Su, W., Wang, J., and Lochovsky, F. H. 2006. Holistic schema matching for Web query interfaces. In Proceedings of the 10th International Conference on Extending Database Technology. 77--94.
|
| |
30
|
Su, W., Wang, J., Lochovsky, F. H., and Liu, Y. 2009. PADE: Pair-wise alignment-based data extraction. Tech. rep. HKUST-CS09-01, Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong.
|
| |
31
|
|
| |
32
|
Tao, C. and Embley, D. W. 2007. Automatic hidden-Web table interpretation by sibling page comparison. In Conceptual Modeling -- ER'07. Lecture Notes in Computer Science, vol. 4801 Springer Berlin, 566--581.
|
| |
33
|
|
| |
34
|
Vivan, O. M. and Heuser, C. A. 2002. Semiautomatic generation of data-extraction ontologies from relational databases. In Proceedings of the XVII Simpósio Brasileiro de Banco de Dados. 252--262.
|
 |
35
|
|
| |
36
|
Jiying Wang , Ji-Rong Wen , Fred Lochovsky , Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing, Proceedings of the Thirtieth international conference on Very large data bases, p.408-419, August 31-September 03, 2004, Toronto, Canada
|
| |
37
|
World Wide Web Consortium. 1999. HTML 4.01 specification. http://www.w3.org/TR/REC-html40/.
|
| |
38
|
Wu, W., Doan, A., Yu, C., and Meng, W. 2005. Boot-strapping domain ontology for semantic Web services from source Web sites. In Proceedings of the 6th VLDB Workshop on Technologies for E-Services. 11--12.
|
| |
39
|
|
| |
40
|
|
 |
41
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060760]
|
|