|
ABSTRACT
A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
A. Blum. Empirical support for WINNOW and weighted majority algorithms: results on a calendar scheduling domain. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Tahoe, California, 1995. Morgan Kaufmann.
|
| |
3
|
|
| |
4
|
XML path language (XPath) version 1.0. Available from http://www.w3.org/TR/1999/REC-xpath-19991116, 1999.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
C.-N. Hsu. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. In Papers from the 1998 Workshop on AI and Information Integration, Madison, WI, 1998. AAAI Press.
|
| |
12
|
HTML 4.01 specification. http://www.w3.org/TR/html4/, 1999.
|
| |
13
|
M. Hurst. The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh, 2000.
|
| |
14
|
L. S. Jensen and W. W. Cohen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, WA, 2001.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
20
|
S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20(7):629--679, 1994.
|
 |
21
|
|
| |
22
|
I. Muslea, S. Minton, and C. Knoblock. Wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 16(12), 1999.
|
| |
23
|
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
Clean up your web pages with HTML TIDY. http://www.w3.org/People/Raggett/tidy/, 1999.
|
| |
29
|
|
| |
30
|
|
CITED BY 39
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Natalie Glance , Matthew Hurst , Kamal Nigam , Matthew Siegler , Robert Stockton , Takashi Tomokiyo, Deriving marketing intelligence from online discussion, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aleksander Pivk , Philipp Cimiano , York Sure , Matjaz Gams , Vladislav Rajkovič , Rudi Studer, Transforming arbitrary tables into logical form with TARTAR, Data & Knowledge Engineering, v.60 n.3, p.567-595, March, 2007
|
|
|
|
|
|
|
|
|
|
|
|
Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog , Bernhard Krüpl , Bernhard Pollak, Towards domain-independent information extraction from web tables, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen , Di Wu, Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Methods for domain-independent information extraction from the web: an experimental comparison, Proceedings of the 19th national conference on Artifical intelligence, p.391-398, July 25-29, 2004, San Jose, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Raymond Kosala , Maurice Bruynooghe , Jan Van Den Bussche , Hendrik Blocked, Information extraction from web documents based on local unranked tree automaton inference, Proceedings of the 18th international joint conference on Artificial intelligence, p.403-408, August 09-15, 2003, Acapulco, Mexico
|
|