ACM Home Page
Please provide us with feedback. Feedback
A flexible learning system for wrapping tables and lists in HTML documents
Full text PdfPdf (598 KB)
Source International World Wide Web Conference archive
Proceedings of the 11th international conference on World Wide Web table of contents
Honolulu, Hawaii, USA
SESSION: Extraction and Visualization table of contents
Pages: 232 - 241  
Year of Publication: 2002
ISBN:1-58113-449-5
Authors
William W. Cohen  WhizBang Labs, Pittsburgh, PA
Matthew Hurst  WhizBang Labs, Pittsburgh, PA
Lee S. Jensen  WhizBang Labs, Pittsburgh, PA
Sponsors
ACM: Association for Computing Machinery
: WWW'02
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 108,   Citation Count: 39
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/511446.511477
What is a DOI?

ABSTRACT

A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
A. Blum. Empirical support for WINNOW and weighted majority algorithms: results on a calendar scheduling domain. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Tahoe, California, 1995. Morgan Kaufmann.
 
3
 
4
XML path language (XPath) version 1.0. Available from http://www.w3.org/TR/1999/REC-xpath-19991116, 1999.
 
5
 
6
 
7
 
8
 
9
 
10
 
11
C.-N. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In Papers from the 1998 Workshop on AI and Information Integration, Madison, WI, 1998. AAAI Press.
 
12
HTML 4.01 specification. http://www.w3.org/TR/html4/, 1999.
 
13
M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh, 2000.
 
14
L. S. Jensen and W. W. Cohen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001.
 
15
 
16
 
17
 
18
 
19
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
20
S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20(7):629--679, 1994.
21
 
22
I. Muslea, S. Minton, and C. Knoblock. Wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 16(12), 1999.
 
23
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.
 
24
 
25
 
26
 
27
 
28
Clean up your web pages with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.
 
29
 
30

CITED BY  39

Collaborative Colleagues:
William W. Cohen: colleagues
Matthew Hurst: colleagues
Lee S. Jensen: colleagues