| Automatic wrapper induction from hidden-web sources with domain knowledge |
| Full text |
Pdf
(221 KB)
|
Source
|
Workshop On Web Information And Data Management
archive
Proceeding of the 10th ACM workshop on Web information and data management
table of contents
Napa Valley, California, USA
SESSION: Data mining and clustering
table of contents
Pages 9-16
Year of Publication: 2008
ISBN:978-1-60558-260-3
|
|
Authors
|
|
Pierre Senellart
|
INRIA Saclay & TELECOM ParisTech, Paris, France
|
|
Avin Mittal
|
Indian Institute of Technology, Bombay, India
|
|
Daniel Muschick
|
Technische Universität Graz, Graz, Austria
|
|
Rémi Gilleron
|
Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France
|
|
Marc Tommasi
|
Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 14, Downloads (12 Months): 171, Citation Count: 0
|
|
|
ABSTRACT
We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. Barbosa and J. Freire. Siphoning hidden-Web data through keyword-based interfaces. In Proc. Simpósio Brasileiro de Bancos de Dados, Brasília, Brasil, Oct. 2004.
|
| |
2
|
BrightPlanet. The deep Web: Surfacing hidden value. White Paper, July 2001.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
Robert B. Doorenbos , Oren Etzioni , Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web, Proceedings of the first international conference on Autonomous agents, p.39-48, February 05-08, 1997, Marina del Rey, California, United States
[doi> 10.1145/267658.267666]
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
F. Jousse, R. Gilleron, I. Tellier, and M. Tommasi. Conditional Random Fields for XML trees. In Proc. ECML Workshop on Mining and Learning in Graphs, Berlin, Germany, Sept. 2006.
|
| |
13
|
|
| |
14
|
|
| |
15
|
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields. In Proc. CoNLL, Edmonton, Canada, May 2003.
|
| |
16
|
A. Mittal. Probing the hidden Web. Research internship report. Technical Report 479, Gemo, INRIA Futurs, July 2007.
|
| |
17
|
D. Muschick. Unsupervised learning of XML tree annotations. Master's thesis, Université de Technologie de Lille and Technischen Universität Graz, June 2007.
|
| |
18
|
|
 |
19
|
|
| |
20
|
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.
|
| |
21
|
Princeton University Cognitive Science Laboratory. WordNet. http://wordnet.princeton.edu/.
|
| |
22
|
|
| |
23
|
S. Sarawagi and W. W. Cohen. Semi-Markov conditional random fields for information extraction. In Proc. NIPS, Vancouver, Canada, Dec. 2004.
|
| |
24
|
P. Senellart. Comprendre le Web caché. Understanding the Hidden Web. PhD thesis, Université Paris-Sud 11, Orsay, France, Dec. 2007.
|
| |
25
|
|
| |
26
|
A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proc. CoNLL, New York, USA, June 2006.
|
| |
27
|
B. Thomas. Bottom-up learning of logic programs for information extraction from hypertext documents. In Proc. PKDD, Catvat-Dubrovnik, Croatia, Sept. 2003.
|
| |
28
|
W3C. HTML 4.01 specification, Sept. 1999. http://www.w3.org/TR/REC-html40/.
|
| |
29
|
W3C. Web Services Description Language (WSDL) 1.1, Mar. 2001. http://www.w3.org/TR/wsdl.
|
| |
30
|
W. Wu, A. Doan, C. T. Yu, and W. Meng. Bootstrapping domain ontology for semantic Web services from source Web sites. In Proc. Technologies for E-Services, Trondheim, Norway, Sept. 2005.
|
 |
31
|
|
 |
32
|
|
| |
33
|
|
|