ACM Home Page
Please provide us with feedback. Feedback
Automatic wrapper induction from hidden-web sources with domain knowledge
Full text PdfPdf (221 KB)
Source
Workshop On Web Information And Data Management archive
Proceeding of the 10th ACM workshop on Web information and data management table of contents
Napa Valley, California, USA
SESSION: Data mining and clustering table of contents
Pages 9-16  
Year of Publication: 2008
ISBN:978-1-60558-260-3
Authors
Pierre Senellart  INRIA Saclay & TELECOM ParisTech, Paris, France
Avin Mittal  Indian Institute of Technology, Bombay, India
Daniel Muschick  Technische Universität Graz, Graz, Austria
Rémi Gilleron  Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France
Marc Tommasi  Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 171,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458502.1458505
What is a DOI?

ABSTRACT

We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
L. Barbosa and J. Freire. Siphoning hidden-Web data through keyword-based interfaces. In Proc. Simpósio Brasileiro de Bancos de Dados, Brasília, Brasil, Oct. 2004.
 
2
BrightPlanet. The deep Web: Surfacing hidden value. White Paper, July 2001.
 
3
 
4
 
5
 
6
7
 
8
 
9
10
 
11
 
12
F. Jousse, R. Gilleron, I. Tellier, and M. Tommasi. Conditional Random Fields for XML trees. In Proc. ECML Workshop on Mining and Learning in Graphs, Berlin, Germany, Sept. 2006.
 
13
 
14
 
15
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields. In Proc. CoNLL, Edmonton, Canada, May 2003.
 
16
A. Mittal. Probing the hidden Web. Research internship report. Technical Report 479, Gemo, INRIA Futurs, July 2007.
 
17
D. Muschick. Unsupervised learning of XML tree annotations. Master's thesis, Université de Technologie de Lille and Technischen Universität Graz, June 2007.
 
18
19
 
20
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.
 
21
Princeton University Cognitive Science Laboratory. WordNet. http://wordnet.princeton.edu/.
 
22
 
23
S. Sarawagi and W. W. Cohen. Semi-Markov conditional random fields for information extraction. In Proc. NIPS, Vancouver, Canada, Dec. 2004.
 
24
P. Senellart. Comprendre le Web caché. Understanding the Hidden Web. PhD thesis, Université Paris-Sud 11, Orsay, France, Dec. 2007.
 
25
 
26
A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proc. CoNLL, New York, USA, June 2006.
 
27
B. Thomas. Bottom-up learning of logic programs for information extraction from hypertext documents. In Proc. PKDD, Catvat-Dubrovnik, Croatia, Sept. 2003.
 
28
W3C. HTML 4.01 specification, Sept. 1999. http://www.w3.org/TR/REC-html40/.
 
29
W3C. Web Services Description Language (WSDL) 1.1, Mar. 2001. http://www.w3.org/TR/wsdl.
 
30
W. Wu, A. Doan, C. T. Yu, and W. Meng. Bootstrapping domain ontology for semantic Web services from source Web sites. In Proc. Technologies for E-Services, Trondheim, Norway, Sept. 2005.
31
32
 
33

Collaborative Colleagues:
Pierre Senellart: colleagues
Avin Mittal: colleagues
Daniel Muschick: colleagues
Rémi Gilleron: colleagues
Marc Tommasi: colleagues