ACM Home Page
Please provide us with feedback. Feedback
The Lixto data extraction project: back and forth between theory and practice
Full text PdfPdf (431 KB)
Source Symposium on Principles of Database Systems archive
Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Paris, France
SESSION: Invited talk table of contents
Pages: 1 - 12  
Year of Publication: 2004
ISBN:158113858X
Authors
Georg Gottlob  DBAI, TU Wien, Austria
Christoph Koch  DBAI, TU Wien, Austria
Robert Baumgartner  Lixto Software GmbH, Austria
Marcus Herzog  Lixto Software GmbH, Austria
Sergio Flesca  D.E.I.S. - Università della Calabria, Italy
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGMOD: ACM Special Interest Group on Management of Data
SIGART: ACM Special Interest Group on Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 68,   Citation Count: 13
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1055558.1055560
What is a DOI?

ABSTRACT

We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
R. Baumgartner, S. Eichholz, S. Flesca, G. Gottlob, and M. Herzog. Semantic Markup of News Items with Lixto, 2003.
 
3
 
4
 
5
R. Baumgartner, S. Flesca, G. Gottlob, and M. Herzog. "Building Dynamic Information Portals - A Case Study in the Agrarian Domain". In Proc. IS, 2002.
 
6
R. Baumgartner, M. Herzog, and G. Gottlob. "Visual Programming of Web Data Aggregation Applications". In Proc. IIWeb-03, 2003.
7
 
8
9
 
10
J. Doner. "Tree Acceptors and some of their Applications". Journal of Computer and System Sciences,4:406--451, 1970.
 
11
 
12
 
13
E. Gold. "Language Identification in the Limit". Inform. Control, 10:447--474, 1967.
14
 
15
G. Gottlob, C. Koch, and R. Pichler. "Efficient Algorithms for Processing XPath Queries". In Proc. VLDB 2002, Hong Kong, China, 2002.
16
 
17
G. Gottlob, C. Koch, and R. Pichler. "XPath Query Evaluation: Improving Time and Space Efficiency". In ICDE'03, Bangalore, India, Mar. 2003.
18
 
19
 
20
 
21
C. Koch. "Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach". In Proc. VLDB 2003, pages 249--260, 2003.
 
22
R. Kosala, H. Blockeel, M. Bruynooghe, and J. V. den Bussche. "Information Extraction from Web Documents based on Local Unranked Tree Automaton Inference". In Proc. IJCAI, 2003.
 
23
N. Kushmerick, D. Weld, and R. Doorenbos. "Wrapper Induction for Information Extraction". In Proc. IJCAI, 1997.
 
24
 
25
 
26
 
27
 
28
 
29
 
30
Mostrare project. www.grappa.univ-lille3.fr/mostrare/.
31
 
32
33
 
34
 
35
36
 
37
J. Thatcher and J. Wright. "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-order Logic". Mathematical Systems Theory,2(1):57--81, 1968.
 
38
 
39
World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, Nov. 1999.

CITED BY  13
Collaborative Colleagues:
Georg Gottlob: colleagues
Christoph Koch: colleagues
Robert Baumgartner: colleagues
Marcus Herzog: colleagues
Sergio Flesca: colleagues